Copyright (⌐) 1998-1999, by Trexar Technologies, Inc.
E-Mail: info@trexar.com
The MacHeadlines Site Format (MHSF) is the format MacHeadlines uses in order to extract headlines from any given Web Site.
To determine the way to extract information from a new site, follow these steps:
1. Point your favorite Web Browser at the News Site you're interested in.
2. Select one of the headlines in the site, and write it down.
3. Save the HTML source of that page as an HTML Source file on your disk.
4. Open the HTML Source in some Text Editor.
5. Search for the headline you copied on step 2.
6. Note the pattern of HTML tags in it's vicinity, and use the MHSF to define the parse format for that site.
Optionally, you can ask me to help you out, and 10 times out of 11 you'll have an answer quickly.
MacHeadlines opens connections with Web Sites using the HTTP protocol, and expects to recieve HTML content from the other side of the connection.
MacHeadlines uses the Site Format defined by the user in the Tracked Site Information dialog to extract headline from the HTML stream.
This stream is logically divided into to types of tokens: either HTML tags or simple text.
The format defined by the user describes the token descriptor sequence after which a headline will apear.
In most cases the token descriptor sequence will end with an HTML link tag (<A HREF...), but this is not required.
The following token descriptor sequences are valid:
(<FONT )(<A HREF)
(<P)(<A HREF)
(<LI>)(<A HREF)
'Last'{<FONT}
(<P>)(<B>)
(<TR){<B>}
The sequences are defined using a number of token descriptors.
These descriptors are:
( and ) Parenthesis
{ and } Braces (or, commonly, Curly Brackets)
' and ' Apostrophe (or, commonly, Single Quotes)
" and " Quotes (or, commonly, Double Quotes)
[ and ] Brackets - used by previous versions of MacHeadlines, now obsolete
All descriptors must be balanced, meaning opening brackets must be balanced by closing brackets, and the same is true for parentheses, and no spaces should appear between token descriptors.
The following token descriptor sequences are non-valid:
(<FONT ){<A HREF)
(<P)(<A HREF
<LI>)(<A HREF)
'Last{<FONT}
(<P>)"<B>)
(<TR) {<B>}
Once MacHeadlines starts recieving an HTML stream, it checks what the next token descriptor should be, and ignores the HTML stream until that tag is encountered.
Once a successful match has been made, NewTicker checks what the next token descriptor is, and adapt it's behaviour accordingly.
The meaning is of the token descriptors is as follows:
Parenthesis - start expecting the HTML tag within the brackets, and if the next HTML is not it, re-start from the first token descriptor.
Curly Brackets - wait for the HTML tag within the brackets, and ignore all other HTML tags encountered until the awaited HTML tag is encountered.
Single Quotes - start expecting the text (which isn't an HTML tag) within the brackets, and if the next text is not it, re-start from the first token descriptor.
Double Quotes - wait for the text (which isn't an HTML tag) within the brackets, and ignore any other text encountered until the awaited text is encountered.
If the current match token descriptor is the last one, our start extracting the incoming text as a header, and the either the brackting HTML tag is encountered.
Saying this, this is the meaning of the valid token descriptor sequences described above:
(<FONT )(<A HREF) - Ignore the HTML tokens until the font-face-changing HTML tag <FONT ...> is encountered, then the next HTML tag encountered must be the HTML link tag <A HREF...>.
If this is so, extract the address of the link to be the URL launched when clicking the header, and extract the all text until the </A> (or </P> and <P>) HTML tags do be the required headline.
(<P)(<A HREF)
(<LI>)(<A HREF) - Same as the one above, except the initial trigger is <P...> or <LI> respectively.
'Last'{<FONT} - Ignore the incoming stream until the text "Last" is encountered. Once that text is encountered, ignore all HTML tags until the font-face-changing HTML tag <FONT ...> is encountered.
When (and if) this is so, extract the all text until the </FONT> tag, and make that the headline.
The URL to be launched when clicking the headline is the same URL used for pointing at the site.
(<P>)(<B>) - Ignore the incoming stream until the paragraph-open HTML tag <P> is encountered. Once that tag is encountered, the next HTML tag encountered must be the font-bolding HTML tag <B>.
When (and if) this is so, extract the all text until the </B> tag, and make that the headline.
The URL to be launched when clicking the headline is the same URL used for pointing at the site.
(<TR){<B>} - Ignore the incoming stream until the paragraph-open HTML tag <P> is encountered. Once that tag is encountered, ignore all HTML tag until the HTML tag encountered is the font-bolding HTML tag <B>.
When (and if) this is so, extract the all text until the </B> tag, and make that the headline.
The URL to be launched when clicking the headline is the same URL used for pointing at the site.
Note: In most cases, for most sites in which the headline of interest is an HTML link, the Site Format can be (<A HREF).
Note: For sites described using HTML link tags, MacHeadlines only presents headlines longer than 3 words. This is done to prevens headlines such as "Technology", "News", "Last Week" etc.